摘要

本研究旨在透過S&P500成分股觀察,選取前四大產業類別之個股

自製投資組合並利用移動窗格最佳化檢驗投資組合是否具有更勝S&P500指數之表現

過程中利用R語言完成網路爬蟲、資料前處理與時間序列格式化、資料視覺化、移動窗格最佳化回測

希望可以透過此篇研究讓金融資料數據化展現

研究流程架構圖

將透過S&P500的產業分析挑選前四大產業個別兩支個股,篩選依據為利用2018/06/19~2020/06/19之股價Mean Log Returns以及Stdev Log Returns作為評斷

組成投資組合後進一步利用2020/06/19~2022/06/19之股價進行移動窗格最佳化(Walk Forward Optimization),

本篇研究設定In-Sample-Data週期為4個月(相關移動窗格法最佳化可參考移動窗格法最佳化欄位說明) 移動窗格法示意圖

最後利用R相關套件去進行此段時間回測

並與大盤以及未最佳化之投組去做績效分析。

移動窗格法最佳化

一般來說回測有2種方法:

第一種方法:把手上所有的歷史數據都拿來做backtesting,像下面的這張圖一樣。用所有的歷史數據(稱之為In-Sample-Data)來跑backtesting(下圖中灰色區域)。跑完之後得到理想的參數和模型,然後就用來實盤(下圖中紅色區域)。

但是這種回測有個問題:過度擬合。往往回測結果不錯,但是實盤就是不好看。

另一種方式,稱之為Walk forward Backtesting,前進式回測。

即:如果有12個星期的歷史資料可以回測,我們先用第1到第4個星期的數據來跑優化,然後將第5個星期的數據模擬實盤。這時候第1-4個星期的數據就是In-Sample-Data,第5個星期的績效就是Out-Of-Sample的績效。

然後我們測試窗口後挪一個星期,用第2-5個星期數據的回測,用第6個星期的數據模擬實盤。這樣重複8次。最後將所有out-of-sample的結果結合在一起(就是圖下面綠色的部分)。就會是這個策略實際真正可能出現的績效performance。

這種方法因為做了許多次的walk forward test(這個例子裡面做了7次),所以我們在真實實盤的時候(就等於是第8次的walk forward test),實盤績效就會特別接近之前7次的模擬實盤績效的結果。

Ref:【学习笔记】前进式回测Walk forward Backtesting.

3.1 使用套件說明

rvest

rvest 套件擷取任意的網頁資料,並將大量結構化的資料直接匯入 R 中,

讓資料分析者可以省去手動整理資料的時間。

dplyr

dplyr() 套件中融入很多概念與結構化查詢語言(Structured Query Language,SQL)相仿的函數。

搭配 %>% 運算子一起使用,能夠讓我們整理資料的能力獲得一個檔次的提升!

DT

可將所有的數據呈現為交互式數據表,可自行選擇每次呈現數量的頻率為何。

ggplot2

以繪圖文法概念為基礎所發展出來的一套 R 繪圖系統,

可繪製各種高品質圖形的 R 繪圖套件,是一套相當受歡迎的繪圖工具。

forcats

此套件中的 fct_reorder() 函數可以將因子 f 類別出現的排列順序依照其他變數更動,

可幫助我們直接排序前四大產業。

3.2 使用套件說明

quantmod

此套件用來做股票相關分析,可以利用此套件做到快速獲取股價資料,

以及簡易的K線圖繪製。

purrr

此套件中的 map() 函數可以將向量資料運用於函數中,並直接返回 list 形式,

map_dbl() 函數可以將向量資料運用於函數中,並直接返回 dbl 形式。

tibble

此套件中的 rownames_to_column() 函數可以協助我們整理資料。

stringr

此套件中的 str_c() 函數可以將多字串合併為單一字串

plotly

plotly 的 R 圖形庫可以製作交互式、互動式的圖形

在此處可以更方便讀者自行操作。

3.3 使用套件說明

PortfolioAnalytics

此套件幫助我們設定投資組合限定與目標條件,

並可以自行設定參數實現移動窗格法最佳化。

PerformanceAnalytics

此套件幫助我們評估投資組合之績效,以及報酬率圖表可視化

讓我們可以有效分析策略是否有效。

一. 研究主題

利用 R 語言分析 S&P500 的產業分析挑選前四大產業個別兩支個股,

篩選依據為利用2018/06/19~2020/06/19之股價Mean Log Returns以及Stdev Log Returns作為評斷,

組成投資組合後進一步利用2020/06/19~2022/06/19之股價進行移動窗格最佳化(Walk Forward),

最後利用R相關套件去進行此段時間回測並與大盤以及未最佳化之投組去做績效分析,檢驗策略是否有效。

二. 研究動機

近來股票市場成為全民關注焦點,其中標準普爾500指數,簡稱S&P500,成為投資人都想要打敗的目標,

而自身在交易的過程中,也時常思考有沒有辦法透過分析股票的歷史數據,去挑選出表現以及成長更好的股票,

在挑選這些股票後,有沒有辦法去好好的管理這些股票的比率,以自製更優於市場表現的基金。

三. 研究方法

分析架構

利用R語言去爬取S&P500的的成分股,並分析成分股的產業組成,

整理資料成需要的形式,並挑選成分股中前四大相關產業,將各產業挑選2支個股去組成投資組合,篩選方法依據

  • 在設定風險條件(sd.log.returns < 0.03)下歷史數據報酬最高的股票

  • 波動度(Variability)最高的股票

並利用R強大的 PortfolioAnalytics 、 PerformanceAnalytics 的投資組合套件,做到 Walk forward optimization (移動窗格法),並在最後以S&P500、移動窗格法產生的最佳化投資組合、和等比例投資組合標的,做出績效的回報,並給予研究結論。

library(rvest)
library(dplyr)
library(DT)
library(ggplot2)
library(forcats)

3.1 S&P500 成分股分析

1.網路爬蟲與整理

# Web-scrape SP500 stock list
sp_500 <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
  html_node("table.wikitable") %>%
  html_table() %>%
  select("Symbol", Security, "GICS Sector", "GICS Sub-Industry") %>%
  as_tibble()
# Format names
names(sp_500) <- sp_500 %>% 
  names() %>% 
  make.names()
# Show results
datatable(sp_500, options = list(pageLength = 5))

2.統計各類資料數量

# show in condensed format
sp_500 %>% 
  lapply(function(x) x %>% unique() %>% length()) %>%
  unlist() 
##            Symbol          Security       GICS.Sector GICS.Sub.Industry 
##               504               504                11               122

3.S&P500 各產業數量柱狀圖

#Visualiztion industry in SP500
sp_500 %>%
  # Summarise data by frequency
  group_by(GICS.Sector) %>%
  summarise(count = n()) %>%
  # Visualize 
  ggplot(aes(x = GICS.Sector %>% fct_reorder(count),
             y = count
  )) + 
  geom_bar(stat = "identity",fill = c("gray",
                                      "gray",
                                      "gray",
                                      "gray",
                                      "red",
                                      "red",
                                      "red",
                                      "red",
                                      "gray",
                                      "gray",
                                      "gray")) +
  geom_text(aes(label = count), size = 3, nudge_y = 4, nudge_x = .1) + 
  scale_y_continuous(limits = c(0,100)) +
  ggtitle(label = "Sector Frequency Among SP500 Stocks") +
  xlab(label = "GICS Sector") +
  theme(plot.title = element_text(size = 16)) + 
  coord_flip() 

library(quantmod)
library(purrr)
library(tibble)
library(stringr)
library(plotly)

3.2 投資組合篩選

1.選取前四大相關產業

1.Information Technology

2.Industrials

3.Financials

4.Health Care

# 選取前四大相關產業
sp_500_Information_Technology <- sp_500 %>% 
  filter(GICS.Sector=="Information Technology")
sp_500_Industrials <- sp_500 %>% 
  filter(GICS.Sector=="Industrials")
sp_500_Financials <- sp_500 %>% 
  filter(GICS.Sector=="Financials")
sp_500_Health_Care <- sp_500 %>% 
  filter(GICS.Sector=="Health Care")

2.撰寫獲取股價時間序列資料函數(有興趣者可以打開code選項)

# Get Stock Prices Function
get_stock_prices <- function(ticker, return_format = "tibble", ...) {
  # Get stock prices
  stock_prices_xts <- getSymbols(Symbols = ticker, auto.assign = FALSE, ...)
  Time <- index(stock_prices_xts)
  # Rename
  names(stock_prices_xts) <- c("Open", "High", "Low", "Close", "Volume", "Adjusted")
  # Return in xts format if tibble is not specified
  if (return_format == "tibble") {
    stock_prices <- stock_prices_xts %>%
      as_tibble() %>%
      rownames_to_column(var = "Date") %>%
      mutate(Date = Time)
  } else {
    stock_prices <- stock_prices_xts
  }
  stock_prices
}

3.撰寫計算股價 Log return 函數(有興趣者可以打開code選項)

\[R = \frac{ln(\frac{V_f}{V_i})}{t}\\V_i =the\ inital\ value\ of\ the\ investment\\V_f = the\ final\ value\ of\ the\ investment\\t = the\ number\ of\ time\ periods\]

# Get Log Returns Function
get_log_returns <- function(x, return_format = "tibble", period = 'daily', ...) {
  Time <- x$Date
  # Convert tibble to xts
  if (!is.xts(x)) {
    x <- xts(x[,-1], order.by = x$Date)
  }
  # Get log returns
  log_returns_xts <- periodReturn(x = x$Adjusted, type = 'log', period = period, ...)
  # Rename
  names(log_returns_xts) <- "Log.Returns"
  # Return in xts format if tibble is not specified
  if (return_format == "tibble") {
    log_returns <- log_returns_xts %>%
      as_tibble() %>%
      rownames_to_column(var = "Date") %>%
      mutate(Date = Time)
  } else {
    log_returns <- log_returns_xts
  }
  log_returns
}

4.利用3.2.2求取選取產業之股價並利用3.2.3計算股價 Log return

(以下將先以Information Technology Company做示範)

(以2018/06/19~2020/06/19股價表現作為之後篩選依據)

# Mapping the Functions to Information Technology Company
sp_500_Information_Technology <- sp_500_Information_Technology %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

5.資料整理

觀察資料
sp_500_Information_Technology%>% 
  select(Symbol, stock.prices:log.returns)
## # A tibble: 74 × 3
##    Symbol stock.prices       log.returns       
##    <chr>  <list>             <list>            
##  1 ACN    <tibble [504 × 7]> <tibble [504 × 2]>
##  2 ADBE   <tibble [504 × 7]> <tibble [504 × 2]>
##  3 ADP    <tibble [504 × 7]> <tibble [504 × 2]>
##  4 AKAM   <tibble [504 × 7]> <tibble [504 × 2]>
##  5 AMD    <tibble [504 × 7]> <tibble [504 × 2]>
##  6 APH    <tibble [504 × 7]> <tibble [504 × 2]>
##  7 ADI    <tibble [504 × 7]> <tibble [504 × 2]>
##  8 ANSS   <tibble [504 × 7]> <tibble [504 × 2]>
##  9 AAPL   <tibble [504 × 7]> <tibble [504 × 2]>
## 10 AMAT   <tibble [504 × 7]> <tibble [504 × 2]>
## # … with 64 more rows
其中的第一個資料裡的stock.prices的tibble(ACN)
sp_500_Information_Technology$stock.prices[[1]] 
## # A tibble: 504 × 7
##    Date        Open  High   Low Close  Volume Adjusted
##    <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
##  1 2018-06-19  162.  163.  161.  163. 2574400     155.
##  2 2018-06-20  163   163.  162.  162. 2026800     153.
##  3 2018-06-21  161.  162.  160.  160. 2034200     151.
##  4 2018-06-22  160.  160.  158.  160. 3623900     151.
##  5 2018-06-25  158.  158.  155.  156. 3134500     148.
##  6 2018-06-26  157.  158.  156.  157. 2164000     148.
##  7 2018-06-27  157.  159.  155.  155. 3191900     147.
##  8 2018-06-28  160.  166.  160.  164. 5801100     156.
##  9 2018-06-29  164.  166.  163.  164. 3370000     155.
## 10 2018-07-02  162.  163.  161.  163. 2279400     154.
## # … with 494 more rows
選取我們需要使用的資料
sp_500_Information_Technology %>% 
  select(Symbol, mean.log.returns:n.trade.days)
## # A tibble: 74 × 4
##    Symbol mean.log.returns sd.log.returns n.trade.days
##    <chr>             <dbl>          <dbl>        <dbl>
##  1 ACN            0.000490         0.0195          504
##  2 ADBE           0.00102          0.0246          504
##  3 ADP            0.000264         0.0213          504
##  4 AKAM           0.000415         0.0208          504
##  5 AMD            0.00233          0.0397          504
##  6 APH            0.000216         0.0207          504
##  7 ADI            0.000489         0.0261          504
##  8 ANSS           0.000978         0.0230          504
##  9 AAPL           0.00132          0.0227          504
## 10 AMAT           0.000503         0.0314          504
## # … with 64 more rows

6.將Information Technology Company各股票視覺化: Stock Risk vs Reward

此處圖表為互動式圖表,使用者可以將滑鼠游標放置圖表上,便可查看資料的資訊

plot_ly(data   = sp_500_Information_Technology,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Information_Technology Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

7.篩選標的,透過以下兩種方法

方法一 : Stock Risk < 0.03 之中報酬率最好的股票
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Information_Technology_picked_1 <- sp_500_Information_Technology %>%
                                          filter(mean.log.returns >= 0.001,
                                                 sd.log.returns < 0.03) %>%
                                          select(Symbol, mean.log.returns:n.trade.days) %>%
                                          arrange(mean.log.returns %>% desc())
sp_500_Information_Technology_picked_1[1,]
## # A tibble: 1 × 4
##   Symbol mean.log.returns sd.log.returns n.trade.days
##   <chr>             <dbl>          <dbl>        <dbl>
## 1 NOW             0.00155         0.0283          504
方法二: Stock Risk 最大的股票(高波動度)
# METHOD2 : Highest sd.log.returns 
sp_500_Information_Technology_picked_2 <- sp_500_Information_Technology %>%
                                          select(Symbol, mean.log.returns:n.trade.days) %>%
                                          arrange(sd.log.returns %>% desc())
sp_500_Information_Technology_picked_2[1,]
## # A tibble: 1 × 4
##   Symbol mean.log.returns sd.log.returns n.trade.days
##   <chr>             <dbl>          <dbl>        <dbl>
## 1 ENPH            0.00396         0.0554          504

8.觀察篩選標的表現(2018/06/19~2020/06/19)

# NOW
as.character(sp_500_Information_Technology_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Information_Technology_picked_1[1,1]))

# ENPH
as.character(sp_500_Information_Technology_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Information_Technology_picked_2[1,1]))

可以看出我們篩選出的標的在過去的表現都很不錯

成長率高,且趨勢持續向上

9.重複3.2.4-3.2.8步驟應用於不同產業

(完整程式碼將附於本網頁之最後,有興趣者可以自行參考)

最終我們篩選出四大產業的各兩支股票

1.Information Technology : “NOW” “ENPH”

2.Industrials : “GNRC” “CARR”

3.Financials : “MKTX” “LNC”

4.Health Care : “WST” “MRNA”

results <- c("NOW","ENPH","GNRC","CARR","MKTX","LNC","WST","MRNA")
tickers <- results

3.3 移動窗格法最佳化回測實作

library(PortfolioAnalytics)
library(PerformanceAnalytics)

1.獲取自製投資組合測試時間收盤價(2020-06-19~2022-06-19)

portfolioPrices <- NULL
for (ticker in tickers) {
  portfolioPrices <- cbind(portfolioPrices,
                           getSymbols.yahoo(ticker, from = "2020-06-19",to = "2022-06-19", 
                                            periodicity = "daily", auto.assign = F)[,4])
}
datatable(portfolioPrices, options = list(pageLength = 5))

2.運用PortfolioAnalytics套件設定投資組合

-交易成本 : 0.75 %

-單一個股於投組最多比例 : 20 %

-單一個股於投組最少比例 : 5 %

-投資組合單日波動度 : 0.5 %

portfolioReturns <- na.omit(ROC(portfolioPrices))

portf <- portfolio.spec(colnames(portfolioReturns))

portf <- add.constraint(portf, type = 'weight_sum', min_sum = 0.99, max_sum = 1.01)
portf <- add.constraint(portf, type = "transaction_cost", ptc = 0.0075)
portf <- add.constraint(portf, type = 'box', min = .05, max = .20)
portf <- add.objective(portf, type = 'return', name = "mean")
portf <- add.objective(portf, type = 'risk', name = "StdDev", target = 0.005)

3.隨機產生此投資組合之10,000種可能性

set.seed(26736728)
rp <- random_portfolios(portf, 10000, "sample")

4.移動窗格法最佳化回測

每月更新投資組合比例,並以四個月作為最佳化訓練週期
opt_rebal <- optimize.portfolio.rebalancing(portfolioReturns,
                                            portf,
                                            optimize_method = "random",
                                            rp = rp,
                                            rebalance_on = "months",
                                            training_period = 1,
                                            rolling_window = 4)

opt_rebal
## **************************************************
## PortfolioAnalytics Optimization with Rebalancing
## **************************************************
## 
## Call:
## optimize.portfolio.rebalancing(R = portfolioReturns, portfolio = portf, 
##     optimize_method = "random", rp = rp, rebalance_on = "months", 
##     training_period = 1, rolling_window = 4)
## 
## Number of rebalancing dates:  25 
## First rebalance date:
## [1] "2020-06-30"
## Last rebalance date:
## [1] "2022-06-17"
## 
## Annualized Portfolio Rebalancing Return:
## [1] 0.1086059
## 
## Annualized Portfolio Standard Deviation:
## [1] 0.2775822

資產重新平衡次數:25次

投資組合年化報酬:10.86%

投資組合年化標準差:27.76%

視覺化投資組合每月變動比例
chart.Weights(opt_rebal, main = "Rebalanced Weights Over Time")

5.計算相同測試時間下以下三種資產之報酬

1.維持個股等比例的投資組合

2.S&P500

3.移動窗格法最佳化投資組合

# 等比例之投資組合報酬
equal_weight <- rep(1/ncol(portfolioReturns),ncol(portfolioReturns))
benchmark <- Return.portfolio(portfolioReturns,weights = equal_weight)
colnames(benchmark) <- "Benchmark Portfolio"

# S&P500報酬
sp500prices <- getSymbols.yahoo("SPY", from = "2020-06-19",to = "2022-06-19", periodicity = "daily", auto.assign = F)[,4]
sp500Rets <- na.omit(ROC(sp500prices))
sp500Rets <- as.xts(sp500Rets)

# 移動窗格法最佳化投資組合之報酬
rebal_weights <- extractWeights(opt_rebal)
rebal_returns <- Return.portfolio(portfolioReturns, weights = rebal_weights)

6.綜合績效評估

rets_df <- cbind(rebal_returns, benchmark, sp500Rets)
charts.PerformanceSummary(rets_df, main = "P/L Over Time")

四. 研究結論

可以看出我們的投資組合的累積報酬(黑色線)基本上都是勝過追蹤S&P500之ETF SPY(綠色線)的

雖然在獲利的時候未能勝過等比例的篩選出股票的投資組合(紅色線)

但在大盤開始下跌的時候,我們的投資組合較為抗跌

也因此可以看到在2021年末我們的投資組合成為了三種資產中報酬最好的選擇

可以推論出以下兩點

  • 我們的選股策略有效 (紅線與黑線一直都在綠線之上)

  • 我們的移動窗格最佳化有效,讓我們在下跌的時候起到更好的抗跌作用

透過這份研究,這樣的選股策略以及資產配置的調整方法或許可以運用在我們日常的基金管理。

rets_df <- cbind(rebal_returns, benchmark, sp500Rets)
charts.PerformanceSummary(rets_df, main = "P/L Over Time")

五. 附錄與其他

5.1 完整程式碼

此處程式碼為撰寫此份專案的工作R檔,因此排版方面未能做到最好,

但還是希望附上給有興趣的人參考。

library(rvest)
library(forcats)
library(plotly)
library(stringr)
library(purrr)
library(seminr)
# 架構圖
model <- relationships(
  paths(from = "SP500",to = c("Information Technology","Industrials","Financials","Health Care")),
  paths(from = "Information Technology", to = multi_items("Information_Technology_",1:2)),
  paths(from = "Industrials", to = multi_items("Industrials_",1:2)),
  paths(from = "Financials", to = multi_items("Financials_",1:2)),
  paths(from = "Health Care", to = multi_items("Health_Care_",1:2)),
  paths(from = multi_items("Information_Technology_",1:2),to ="Portfolio"),
  paths(from = multi_items("Industrials_",1:2), to = "Portfolio"),
  paths(from = multi_items("Financials_",1:2), to ="Portfolio"),
  paths(from = multi_items("Health_Care_",1:2), to ="Portfolio"),
  paths(from = "Portfolio",to = "Performance Analytics")
)
plot(model)
# Web-scrape SP500 stock list
sp_500 <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
  html_node("table.wikitable") %>%
  html_table() %>%
  select("Symbol", Security, "GICS Sector", "GICS Sub-Industry") %>%
  as_tibble()
# Format names
names(sp_500) <- sp_500 %>% 
  names() %>% 
  make.names()
# Show results
sp_500 

# show in condensed format
sp_500 %>% 
  lapply(function(x) x %>% unique() %>% length()) %>%
  unlist() 

#Visualiztion industry in SP500
sp_500 %>%
  # Summarise data by frequency
  group_by(GICS.Sector) %>%
  summarise(count = n()) %>%
  # Visualize 
  ggplot(aes(x = GICS.Sector %>% fct_reorder(count),
             y = count
  )) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label = count), size = 3, nudge_y = 4, nudge_x = .1) + 
  scale_y_continuous(limits = c(0,100)) +
  ggtitle(label = "Sector Frequency Among SP500 Stocks") +
  xlab(label = "GICS Sector") +
  theme(plot.title = element_text(size = 16)) + 
  coord_flip() 

# 選取前五大相關產業
sp_500_Information_Technology <- sp_500 %>% 
  filter(GICS.Sector=="Information Technology")
sp_500_Industrials <- sp_500 %>% 
  filter(GICS.Sector=="Industrials")
sp_500_Financials <- sp_500 %>% 
  filter(GICS.Sector=="Financials")
sp_500_Health_Care <- sp_500 %>% 
  filter(GICS.Sector=="Health Care")


library(quantmod)
library(tibble)

# Get Stock Prices Function
get_stock_prices <- function(ticker, return_format = "tibble", ...) {
  # Get stock prices
  stock_prices_xts <- getSymbols(Symbols = ticker, auto.assign = FALSE, ...)
  Time <- index(stock_prices_xts)
  # Rename
  names(stock_prices_xts) <- c("Open", "High", "Low", "Close", "Volume", "Adjusted")
  # Return in xts format if tibble is not specified
  if (return_format == "tibble") {
    stock_prices <- stock_prices_xts %>%
      as_tibble() %>%
      rownames_to_column(var = "Date") %>%
      mutate(Date = Time)
  } else {
    stock_prices <- stock_prices_xts
  }
  stock_prices
}

# Get Log Returns Function
get_log_returns <- function(x, return_format = "tibble", period = 'daily', ...) {
  Time <- x$Date
  # Convert tibble to xts
  if (!is.xts(x)) {
    x <- xts(x[,-1], order.by = x$Date)
  }
  # Get log returns
  log_returns_xts <- periodReturn(x = x$Adjusted, type = 'log', period = period, ...)
  # Rename
  names(log_returns_xts) <- "Log.Returns"
  # Return in xts format if tibble is not specified
  if (return_format == "tibble") {
    log_returns <- log_returns_xts %>%
      as_tibble() %>%
      rownames_to_column(var = "Date") %>%
      mutate(Date = Time)
  } else {
    log_returns <- log_returns_xts
  }
  log_returns
}


# Mapping the Functions to Information Technology Company
sp_500_Information_Technology <- sp_500_Information_Technology %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# 觀察資料
sp_500_Information_Technology%>% 
  select(Symbol, stock.prices:log.returns)

# 其中的第一個資料裡的stock.prices的tibble(ACN)
sp_500_Information_Technology$stock.prices[[1]] 

# 選取我們需要使用的資料
sp_500_Information_Technology %>% 
  select(Symbol, mean.log.returns:n.trade.days) 

# Visualize S&P500_Information_Technology Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Information_Technology,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Information_Technology Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Information_Technology_picked_1 <- sp_500_Information_Technology %>%
                                          filter(mean.log.returns >= 0.001,
                                                 sd.log.returns < 0.03) %>%
                                          select(Symbol, mean.log.returns:n.trade.days) %>%
                                          arrange(mean.log.returns %>% desc())
sp_500_Information_Technology_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Information_Technology_picked_2 <- sp_500_Information_Technology %>%
                                          select(Symbol, mean.log.returns:n.trade.days) %>%
                                          arrange(sd.log.returns %>% desc())
sp_500_Information_Technology_picked_2[1,]

#  挑選標的

# NOW
as.character(sp_500_Information_Technology_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Information_Technology_picked_1[1,1]))

# ENPH
as.character(sp_500_Information_Technology_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Information_Technology_picked_2[1,1]))


  
# Mapping the Functions to Industrials Company
sp_500_Industrials <- sp_500_Industrials %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# Visualize S&P500_Industrials Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Industrials,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Industrials Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Industrials_picked_1 <- sp_500_Industrials %>%
  filter(mean.log.returns >= 0.001,
         sd.log.returns < 0.03) %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(mean.log.returns %>% desc())
sp_500_Industrials_picked_1
sp_500_Industrials_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Industrials_picked_2 <- sp_500_Industrials %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(sd.log.returns %>% desc())
sp_500_Industrials_picked_2
sp_500_Industrials_picked_2[1,]

#  挑選標的

# GNRC
as.character(sp_500_Industrials_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Industrials_picked_1[1,1]))

# CARR
as.character(sp_500_Industrials_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Industrials_picked_2[1,1]))


# 把警告的資料排除問題
sp_500_Health_Care <- sp_500_Health_Care %>%
  filter(Symbol!= c("OGN"))

# Mapping the Functions to Health Care Company
sp_500_Health_Care <- sp_500_Health_Care %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# Visualize S&P500_Health_Care Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Health_Care,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Health_Care Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Health_Care_picked_1 <- sp_500_Health_Care %>%
  filter(mean.log.returns >= 0.001,
         sd.log.returns < 0.03) %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(mean.log.returns %>% desc())
sp_500_Health_Care_picked_1
sp_500_Health_Care_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Health_Care_picked_2 <- sp_500_Health_Care %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(sd.log.returns %>% desc())
sp_500_Health_Care_picked_2
sp_500_Health_Care_picked_2[1,]

#  挑選標的

# WST
as.character(sp_500_Health_Care_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Health_Care_picked_1[1,1]))

# MRNA
as.character(sp_500_Health_Care_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Health_Care_picked_2[1,1]))

# 把警告的資料排除問題
sp_500_Financials <- sp_500_Financials %>%
  filter(Symbol!= c("BRK.B"))

# Mapping the Functions to Financials Company
sp_500_Financials <- sp_500_Financials %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# Visualize S&P500_Financials Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Financials,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Financials Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Financials_picked_1 <- sp_500_Financials %>%
  filter(mean.log.returns >= 0.001,
         sd.log.returns < 0.03) %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(mean.log.returns %>% desc())
sp_500_Financials_picked_1
sp_500_Financials_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Financials_picked_2 <- sp_500_Financials %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(sd.log.returns %>% desc())
sp_500_Financials_picked_2
sp_500_Financials_picked_2[1,]

#  挑選標的

# MKTX
as.character(sp_500_Financials_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Financials_picked_1[1,1]))

# LNC
as.character(sp_500_Financials_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Financials_picked_2[1,1]))

# 最終篩選結果
results <-c(sp_500_Information_Technology_picked_1[1,1],sp_500_Information_Technology_picked_2[1,1],
            sp_500_Industrials_picked_1[1,1],sp_500_Industrials_picked_2[1,1],
            sp_500_Financials_picked_1[1,1],sp_500_Financials_picked_2[1,1],
            sp_500_Health_Care_picked_1[1,1],sp_500_Health_Care_picked_2[1,1]) %>%
          as.character()

results

library(rvest)
library(forcats)
library(plotly)
library(stringr)
library(purrr)
library(seminr)
# 架構圖
model <- relationships(
  paths(from = "SP500",to = c("Information Technology","Industrials","Financials","Health Care")),
  paths(from = "Information Technology", to = multi_items("Information_Technology_",1:2)),
  paths(from = "Industrials", to = multi_items("Industrials_",1:2)),
  paths(from = "Financials", to = multi_items("Financials_",1:2)),
  paths(from = "Health Care", to = multi_items("Health_Care_",1:2)),
  paths(from = multi_items("Information_Technology_",1:2),to ="Portfolio"),
  paths(from = multi_items("Industrials_",1:2), to = "Portfolio"),
  paths(from = multi_items("Financials_",1:2), to ="Portfolio"),
  paths(from = multi_items("Health_Care_",1:2), to ="Portfolio"),
  paths(from = "Portfolio",to = "Performance Analytics")
)
plot(model)
# Web-scrape SP500 stock list
sp_500 <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
  html_node("table.wikitable") %>%
  html_table() %>%
  select("Symbol", Security, "GICS Sector", "GICS Sub-Industry") %>%
  as_tibble()
# Format names
names(sp_500) <- sp_500 %>% 
  names() %>% 
  make.names()
# Show results
sp_500 

# show in condensed format
sp_500 %>% 
  lapply(function(x) x %>% unique() %>% length()) %>%
  unlist() 

#Visualiztion industry in SP500
sp_500 %>%
  # Summarise data by frequency
  group_by(GICS.Sector) %>%
  summarise(count = n()) %>%
  # Visualize 
  ggplot(aes(x = GICS.Sector %>% fct_reorder(count),
             y = count
  )) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label = count), size = 3, nudge_y = 4, nudge_x = .1) + 
  scale_y_continuous(limits = c(0,100)) +
  ggtitle(label = "Sector Frequency Among SP500 Stocks") +
  xlab(label = "GICS Sector") +
  theme(plot.title = element_text(size = 16)) + 
  coord_flip() 

# 選取前五大相關產業
sp_500_Information_Technology <- sp_500 %>% 
  filter(GICS.Sector=="Information Technology")
sp_500_Industrials <- sp_500 %>% 
  filter(GICS.Sector=="Industrials")
sp_500_Financials <- sp_500 %>% 
  filter(GICS.Sector=="Financials")
sp_500_Health_Care <- sp_500 %>% 
  filter(GICS.Sector=="Health Care")


library(quantmod)
library(tibble)

# Get Stock Prices Function
get_stock_prices <- function(ticker, return_format = "tibble", ...) {
  # Get stock prices
  stock_prices_xts <- getSymbols(Symbols = ticker, auto.assign = FALSE, ...)
  Time <- index(stock_prices_xts)
  # Rename
  names(stock_prices_xts) <- c("Open", "High", "Low", "Close", "Volume", "Adjusted")
  # Return in xts format if tibble is not specified
  if (return_format == "tibble") {
    stock_prices <- stock_prices_xts %>%
      as_tibble() %>%
      rownames_to_column(var = "Date") %>%
      mutate(Date = Time)
  } else {
    stock_prices <- stock_prices_xts
  }
  stock_prices
}

# Get Log Returns Function
get_log_returns <- function(x, return_format = "tibble", period = 'daily', ...) {
  Time <- x$Date
  # Convert tibble to xts
  if (!is.xts(x)) {
    x <- xts(x[,-1], order.by = x$Date)
  }
  # Get log returns
  log_returns_xts <- periodReturn(x = x$Adjusted, type = 'log', period = period, ...)
  # Rename
  names(log_returns_xts) <- "Log.Returns"
  # Return in xts format if tibble is not specified
  if (return_format == "tibble") {
    log_returns <- log_returns_xts %>%
      as_tibble() %>%
      rownames_to_column(var = "Date") %>%
      mutate(Date = Time)
  } else {
    log_returns <- log_returns_xts
  }
  log_returns
}


# Mapping the Functions to Information Technology Company
sp_500_Information_Technology <- sp_500_Information_Technology %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# 觀察資料
sp_500_Information_Technology%>% 
  select(Symbol, stock.prices:log.returns)

# 其中的第一個資料裡的stock.prices的tibble(ACN)
sp_500_Information_Technology$stock.prices[[1]] 

# 選取我們需要使用的資料
sp_500_Information_Technology %>% 
  select(Symbol, mean.log.returns:n.trade.days) 

# Visualize S&P500_Information_Technology Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Information_Technology,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Information_Technology Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Information_Technology_picked_1 <- sp_500_Information_Technology %>%
                                          filter(mean.log.returns >= 0.001,
                                                 sd.log.returns < 0.03) %>%
                                          select(Symbol, mean.log.returns:n.trade.days) %>%
                                          arrange(mean.log.returns %>% desc())
sp_500_Information_Technology_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Information_Technology_picked_2 <- sp_500_Information_Technology %>%
                                          select(Symbol, mean.log.returns:n.trade.days) %>%
                                          arrange(sd.log.returns %>% desc())
sp_500_Information_Technology_picked_2[1,]

#  挑選標的

# NOW
as.character(sp_500_Information_Technology_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Information_Technology_picked_1[1,1]))

# ENPH
as.character(sp_500_Information_Technology_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Information_Technology_picked_2[1,1]))


  
# Mapping the Functions to Industrials Company
sp_500_Industrials <- sp_500_Industrials %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# Visualize S&P500_Industrials Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Industrials,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Industrials Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Industrials_picked_1 <- sp_500_Industrials %>%
  filter(mean.log.returns >= 0.001,
         sd.log.returns < 0.03) %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(mean.log.returns %>% desc())
sp_500_Industrials_picked_1
sp_500_Industrials_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Industrials_picked_2 <- sp_500_Industrials %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(sd.log.returns %>% desc())
sp_500_Industrials_picked_2
sp_500_Industrials_picked_2[1,]

#  挑選標的

# GNRC
as.character(sp_500_Industrials_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Industrials_picked_1[1,1]))

# CARR
as.character(sp_500_Industrials_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Industrials_picked_2[1,1]))


# 把警告的資料排除問題
sp_500_Health_Care <- sp_500_Health_Care %>%
  filter(Symbol!= c("OGN"))

# Mapping the Functions to Health Care Company
sp_500_Health_Care <- sp_500_Health_Care %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# Visualize S&P500_Health_Care Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Health_Care,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Health_Care Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Health_Care_picked_1 <- sp_500_Health_Care %>%
  filter(mean.log.returns >= 0.001,
         sd.log.returns < 0.03) %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(mean.log.returns %>% desc())
sp_500_Health_Care_picked_1
sp_500_Health_Care_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Health_Care_picked_2 <- sp_500_Health_Care %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(sd.log.returns %>% desc())
sp_500_Health_Care_picked_2
sp_500_Health_Care_picked_2[1,]

#  挑選標的

# WST
as.character(sp_500_Health_Care_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Health_Care_picked_1[1,1]))

# MRNA
as.character(sp_500_Health_Care_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Health_Care_picked_2[1,1]))

# 把警告的資料排除問題
sp_500_Financials <- sp_500_Financials %>%
  filter(Symbol!= c("BRK.B"))

# Mapping the Functions to Financials Company
sp_500_Financials <- sp_500_Financials %>%
  mutate(
    stock.prices = map(Symbol, 
                       function(.x) get_stock_prices(.x, 
                                                     return_format = "tibble",
                                                     from = "2018-06-19",
                                                     to = "2020-06-19")
    ),
    log.returns  = map(stock.prices, 
                       function(.x) get_log_returns(.x, return_format = "tibble")),
    mean.log.returns = map_dbl(log.returns, ~ mean(.$Log.Returns)),
    sd.log.returns   = map_dbl(log.returns, ~ sd(.$Log.Returns)),
    n.trade.days = map_dbl(stock.prices, nrow)
  )  

# Visualize S&P500_Financials Analysis: Stock Risk vs Reward
plot_ly(data   = sp_500_Financials,
        type   = "scatter",
        mode   = "markers",
        x      = ~ sd.log.returns,
        y      = ~ mean.log.returns,
        color  = ~ n.trade.days,
        colors = "Blues",
        size   = ~ n.trade.days,
        text   = ~ str_c("<em>", Security, "</em><br>",
                         "Ticker: ", Symbol, "<br>",
                         "Sub Sector: ", GICS.Sub.Industry, "<br>",
                         "No. of Trading Days: ", n.trade.days),
        marker = list(opacity = 0.8,
                      symbol = 'circle',
                      sizemode = 'diameter',
                      sizeref = 4.0,
                      line = list(width = 2, color = '#FFFFFF'))
) %>%
  layout(title   = 'S&P500_Financials Analysis: Stock Risk vs Reward',
         xaxis   = list(title = 'Risk/Variability (StDev Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwidth = 2),
         yaxis   = list(title = 'Reward/Growth (Mean Log Returns)',
                        gridcolor = 'rgb(255, 255, 255)',
                        zerolinewidth = 1,
                        ticklen = 5,
                        gridwith = 2),
         margin = list(l = 100,
                       t = 100,
                       b = 100),
         font   = list(color = '#FFFFFF'),
         paper_bgcolor = 'rgb(0, 0, 0)',
         plot_bgcolor = 'rgb(0, 0, 0)')

# 篩選標的
# METHOD1 : sd.log.returns < 0.03 and Highest mean.log.returns 
sp_500_Financials_picked_1 <- sp_500_Financials %>%
  filter(mean.log.returns >= 0.001,
         sd.log.returns < 0.03) %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(mean.log.returns %>% desc())
sp_500_Financials_picked_1
sp_500_Financials_picked_1[1,]

# METHOD2 : Highest sd.log.returns 
sp_500_Financials_picked_2 <- sp_500_Financials %>%
  select(Symbol, mean.log.returns:n.trade.days) %>%
  arrange(sd.log.returns %>% desc())
sp_500_Financials_picked_2
sp_500_Financials_picked_2[1,]

#  挑選標的

# MKTX
as.character(sp_500_Financials_picked_1[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Financials_picked_1[1,1]))

# LNC
as.character(sp_500_Financials_picked_2[1,1]) %>%
  getSymbols(auto.assign = FALSE,
             from = "2018-06-19",
             to   = "2020-06-19") %>%
  chartSeries(name = as.character(sp_500_Financials_picked_2[1,1]))

# 最終篩選結果
results <-c(sp_500_Information_Technology_picked_1[1,1],sp_500_Information_Technology_picked_2[1,1],
            sp_500_Industrials_picked_1[1,1],sp_500_Industrials_picked_2[1,1],
            sp_500_Financials_picked_1[1,1],sp_500_Financials_picked_2[1,1],
            sp_500_Health_Care_picked_1[1,1],sp_500_Health_Care_picked_2[1,1]) %>%
          as.character()

results

library(PortfolioAnalytics)
library(PerformanceAnalytics)

tickers <- results

portfolioPrices <- NULL
for (ticker in tickers) {
  portfolioPrices <- cbind(portfolioPrices,
                           getSymbols.yahoo(ticker, from = "2020-06-19",to = "2022-06-19", periodicity = "daily", auto.assign = F)[,4])
}

portfolioReturns <- na.omit(ROC(portfolioPrices))

portf <- portfolio.spec(colnames(portfolioReturns))

portf <- add.constraint(portf, type = 'weight_sum', min_sum = 0.99, max_sum = 1.01)
portf <- add.constraint(portf, type = "transaction_cost", ptc = 0.0075)
portf <- add.constraint(portf, type = 'box', min = .05, max = .20)
portf <- add.objective(portf, type = 'return', name = "mean")
portf <- add.objective(portf, type = 'risk', name = "StdDev", target = 0.005)

rp <- random_portfolios(portf, 10000, "sample")

opt_rebal <- optimize.portfolio.rebalancing(portfolioReturns,
                                            portf,
                                            optimize_method = "random",
                                            rp = rp,
                                            rebalance_on = "months",
                                            training_period = 1,
                                            rolling_window = 4)

opt_rebal

chart.Weights(opt_rebal, main = "Rebalanced Weights Over Time")

equal_weight <- rep(1/ncol(portfolioReturns),ncol(portfolioReturns))
benchmark <- Return.portfolio(portfolioReturns,weights = equal_weight)
colnames(benchmark) <- "Benchmark Portfolio"

sp500prices <- getSymbols.yahoo("SPY", from = "2020-06-19",to = "2022-06-19", periodicity = "daily", auto.assign = F)[,4]
sp500Rets <- na.omit(ROC(sp500prices))
sp500Rets <- as.xts(sp500Rets)

rebal_weights <- extractWeights(opt_rebal)
rebal_returns <- Return.portfolio(portfolioReturns, weights = rebal_weights)

rets_df <- cbind(rebal_returns, benchmark, sp500Rets)

charts.PerformanceSummary(rets_df, main = "P/L Over Time")

5.2 結語

從哪裡得到專案的想法

之前在投資學課堂上,有做一個模擬投資,那時的規則是要管理1,000萬台幣的資金,

那個作業後我意識到當管理資金規模變大時,便不能隨意地操作,因為手續費可能會成為很大的支出,

而在這過程中也發現,不論是基本面分析或技術分析,都逃離不了股價實際表現出來的樣態,

因此在製作此份專案時,便想到如果可以利用股價資料的表現,來作為選股標準,

並利用之前上過金融交易實務課堂學到的移動窗格化回測去測驗自製的投資組合績效,或許會是一個好主題!

選取S&P500作為比較對象是因為之前有讀過關於巴菲特十年之約的文章,

覺得如果能夠打敗巴菲特的想法或許是一個很有趣的方向,便產生了這份專案。

使用了哪些R的技術

利用R語言實作網路爬蟲,對時間序列加以分析以及對資料進行分類,並繪製可視化圖表讓資料視覺化,

並實作了移動窗格最佳化回測,以及評估投資組合績效表現。

使用了哪些新的套件與函數

rvest

rvest 套件擷取任意的網頁資料,並將大量結構化的資料直接匯入 R 中,

讓資料分析者可以省去手動整理資料的時間。

DT

可將所有的數據呈現為交互式數據表,可自行選擇每次呈現數量的頻率為何。

purrr

可將資料整理成需要的形式

plotly

plotly 的 R 圖形庫可以製作交互式、互動式的圖形

在此處可以更方便讀者自行操作。

PortfolioAnalytics

此套件幫助我們設定投資組合限定與目標條件,

並可以自行設定參數實現移動窗格法最佳化。

PerformanceAnalytics

此套件幫助我們評估投資組合之績效,以及報酬率圖表可視化

讓我們可以有效分析策略是否有效。

上課所教的dplyr, ggplot2

上課教的這些套件,是我覺得最實用的,因為在整理資料的過程如果只用傳統命令變數的方法,程式碼會變得複雜無比,

而dplyr讓我可以更快速且精準的整理時間序列資料成所需形式,

至於ggplot2 讓我可以更好的呈現資料想要的視覺化。

最困難的部分

網路爬蟲

在利用R進行線上爬取股價資料的過程中,需要耗費很久的時間且常會有阻擋的問題,

所以需要更換ip位置或是分批爬取資料導致分析過程很冗長。

資料自動化處理

撰寫爬取函數和log return函數時,因為這次處理的資料組成很複雜,

所以花了很多心思在思考如何整理資料成需要的形式

新函數的使用

因為想要實作移動窗格法的關係,花了很多時間研究 PortfolioAnalytics 套件,

以及搞懂需要資料的格式以及參數設置的背後理論基礎。

網站設計與專案呈現

為了讓使用者有更好的觀看體驗,嘗試了許多Rmarkdown的功能以及設計,

希望使用的各位都能很享受這份報告的視覺以及表達。

結語

這學期花了很多時間在學習R語言的使用,從一開始薄弱的程式基礎到現在可以獨立撰寫一份專案,

特別感謝鍾經樊教授的課程指導以及專案建議,張雯晴助教的疑問解答的幫忙

最後在此放上本篇專案的參考資料,這些網路上的資源幫助我許多。

— 計財23劉冠宇